Automated Detection of Health Websites' HONcode Conformity: Can N-gram Tokenization Replace Stemming?

نویسندگان

  • Célia Boyer
  • Ljiljana Dolamic
  • Natalia Grabar
چکیده

Authors evaluated supervised automatic classification algorithms for determination of health related web-page compliance with individual HONcode criteria of conduct using varying length character n-gram vectors to represent healthcare web page documents. The training/testing collection comprised web page fragments extracted by HONcode experts during the manual certification process. The authors compared automated classification performance of n-gram tokenization to the automated classification performance of document words and Porter-stemmed document words using a Naive Bayes classifier and DF (document frequency) dimensionality reduction metrics. The study attempted to determine whether the automated, language-independent approach might safely replace word-based classification. Using 5-grams as document features, authors also compared the baseline DF reduction function to Chi-square and Z-score dimensionality reductions. Overall study results indicate that n-gram tokenization provided a potentially viable alternative to document word stemming.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feasibility of automated detection of HONcode conformity for health-related websites

In this paper, authors evaluate machine learning algorithms to detect the trustworthiness of a website according to HONcode criteria of conduct (detailed in paper). To derive a baseline, we evaluated a Naive Bayes algorithm, using single words as features. We compared the baseline algorithm’s performance to that of the same algorithm employing different feature types, and to the SVM algorithm. ...

متن کامل

Effect of the Named Entity Recognition and Sliding Window on the HONcode Automated Detection of HONcode Criteria for Mass Health Online Content

The Health On the Net’s Foundation (HON) Code of Conduct, HONcode, is the oldest and the most used ethical and trustworthy code for medical and health related information available on the Internet. Until recently, websites voluntarily applying for the HONcode seal were evaluated manually by an expert medical team according to 8 principles, referred to as criteria, and associated published guide...

متن کامل

Generation, Implementation and Appraisal of an N-gram based Stemming Algorithm

A language independent stemmer has always been looked for. Single N-gram tokenization technique works well, however, it often generates stems that start with intermediate characters, rather than initial ones. We present a novel technique that takes the concept of N-gram stemming one step ahead and compare our method with an established algorithm in the field, Porter’s Stemmer. Results indicate ...

متن کامل

Retrieval Experiments at Morpho Challenge 2008

Morpho Challenge 2008 hosted an extrinsic evaluation of morphological analysis that explored whether unsupervised morphology induction could benefit information retrieval. This paper presents results in alternative methods for word normalization using test sets from the Cross-Language Evaluation Forum (CLEF) ad-hoc collections. Preliminary results for the Morpho Challenge 2008 evaluation are co...

متن کامل

JHU/APL Experiments in Tokenization and Non-Word Translation

In the past we have conducted experiments that investigate the benefits and peculiarities attendant to alternative methods for tokenization, particularly overlapping character n-grams. This year we continued this line of work and report new findings reaffirming that the judicious use of n-grams can lead to performance surpassing that of word-based tokenization. In particular we examined: the re...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Studies in health technology and informatics

دوره 216  شماره 

صفحات  -

تاریخ انتشار 2015